Exploration of very large data sets: The CiTree algorithm
نویسندگان
چکیده
Control variables and command history recorded during normal operation of industrial processes are routinely stored in databases for later analysis. These databases constitute a potentially precious source of information that can be extremely useful from a commercial and strategic perspective. However, the extraction of information from a database is often a non-trivial task, requiring the cooperation of various disciplines, in what has now become a field of research in itself, known as ‘data mining’. One of the first tasks for the data miner is to summarize the complexity of the data into a number of distinct clusters, which represent ‘interesting’, often unexpected, behavior patterns of the process under analysis. Powerful computers and efficient clustering algorithms are now available; nonetheless limits are typically exceeded when mining massive databases such as those arising from industrial processes. Therefore, in order to uncover the potentially useful information held by large bodies of data, new clustering algorithms are needed that directly address the problem of size. We present here one such algorithm, which has been successfully applied to a variety of industrial processes. It yields a hierarchical structure of the clusters present in the process; thus providing a detailed representation of the relationships amongst sample units. We also show, as an example, the application of the algorithm to a real case study, which resulted in the extraction of useful information. Key-Words: Data Mining, Optimization, Model, Industrial, Clustering.
منابع مشابه
تعیین عوامل موثر در بروز سرطان معده با استفاده از رویکرد داده کاوی
Background and Aim: Gastric cancer is the second leading cause of cancer death in the world. Due to the prevalence of the disease and the high mortality rate of gastric cancer in Iran, the factors affecting the development of this disease should be taken into account. In this research, two data mining techniques such as Apriori and ID3 algorithm were used in order to investigate the effective f...
متن کاملKnowledge discovery in rubber extrusion processes
This paper describes the outcomes of a study that the EDMANS(**) group has recently performed in a rubber extrusion process, focusing on the knowledge discovery phase previous to the system modeling. Some of the tools developed to satisfy the special needs of such a process are also presented: the CiTree algorithm for clustering subpopulations in massive databases and the PAELLA algorithm for o...
متن کاملAn Incremental DC Algorithm for the Minimum Sum-of-Squares Clustering
Here, an algorithm is presented for solving the minimum sum-of-squares clustering problems using their difference of convex representations. The proposed algorithm is based on an incremental approach and applies the well known DC algorithm at each iteration. The proposed algorithm is tested and compared with other clustering algorithms using large real world data sets.
متن کاملBioinformatics to Biostochastics: Statistical Perspectives and Tasks Ahead
Bioinformatics is an emerging field of science emphasizing the application of mathematics, statistics, and informatics to study and analysis of very large molecular biological (mostly, genetic and genomic) systems (data sets). In a comparatively broader setup of large biological systems without necessarily having a predominant genetic undercurrent, and having genesis in biometry to biostatistic...
متن کاملAN-EUL method for automatic interpretation of potential field data in unexploded ordnances (UXO) detection
We have applied an automatic interpretation method of potential data called AN-EUL in unexploded ordnance (UXO) prospective which is indeed a combination of the analytic signal and the Euler deconvolution approaches. The method can be applied for both magnetic and gravity data as well for gradient surveys based upon the concept of the structural index (SI) of a potential anomaly which is relate...
متن کامل